Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

shivkumr
Copy link
Contributor

What this PR does / why we need it:

This is a new lab to deploy and monitor ML model on Amazon EKS

Which issue(s) this PR fixes:

First PR for this new lab

Fixes # NA

Quality checks

  • My content adheres to the style guidelines

  • I ran make test module="<module>" it was successful (see https://github.com/aws-samples/eks-workshop-v2/blob/main/docs/automated_tests.md)

    EKS Workshop
    AI/ML on EKS
    Deploy and Monitor GenAI Model on EKS
    ✔ Deploy and Monitor GenAI Model on EKS (1342913ms)
    ✔ Install Karpenter and KubeRay Operator (211251ms)
    ✔ Install Jupyterhub (444619ms)
    ✔ Model Training (60899ms)
    ✔ Model Inference (312600ms)
    ✔ Monitor GPU Workloads on EKS (4945ms)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shivkumr shivkumr requested a review from a team as a code owner July 27, 2024 05:18
Copy link

netlify bot commented Jul 27, 2024

Deploy Preview for eks-workshop ready!

Name Link
🔨 Latest commit c729b67
🔍 Latest deploy log https://app.netlify.com/sites/eks-workshop/deploys/672bb70dae2af5000835519d
😎 Deploy Preview https://deploy-preview-1020--eks-workshop.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@shivkumr shivkumr changed the title New Lab AIML lab to deploy and monitor ML model on Amazon EKS New AIML lab to deploy and monitor ML model on Amazon EKS Jul 27, 2024
@bkgardiner bkgardiner self-assigned this Sep 4, 2024
@bkgardiner
Copy link
Contributor

Karpenter should be preinstalled in this lab as it doesn't really add much to the lab. Take a look at the Inference with AWS Inferentia lab (https://www.eksworkshop.com/docs/aiml/inferentia/). This lab comes with Karpenter preinstalled.

In the Jupyter Notebook commands section it would be nice to get some explanation on what this code is exactly doing. This would also enable to user to read through the explanation while waiting for the code to be executed.

Other than that this lab looks really good! Thank you for creating it.

Copy link
Contributor

@bkgardiner bkgardiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@niallthomson niallthomson changed the title New AIML lab to deploy and monitor ML model on Amazon EKS new: Deploy and monitor ML models with GPUs on Amazon EKS Sep 26, 2024
Copy link
Contributor

@svennam92 svennam92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Some comments:

  1. The prepare-environment block should have a link out to the Terraform like in other modules
  2. Please explain hardware infrastructure being used. Can we outline the Karpenter nodepools that are created for the user? Explain what g5 instances are and why we need them for the lab. Example: https://eksworkshop.com/docs/aiml/chatbot/nodepool
  3. The titles are for the AI/ML modules are not distinct enough. How about something like "Training StableDiffusion on NVIDIA GPUs". Deploying and inference is implied when we're doing training.

@niallthomson niallthomson removed this from the Release 10/25 milestone Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants